Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting #10887

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

vera
Copy link
Contributor

@vera vera commented Sep 27, 2024

What this PR does / why we need it:

Currently, all fields regardless of type are indexed in Solr as English text (text_en). With this PR, numerical and date fields are indexed in Solr with appropriate types:

Field type defined in TSV Field type indexed in Solr
int plong
float pdouble
date date_range (solr.DateRangeField)

I chose to index dates as DateRangeField because they can be used to represent dates to any precision, e.g. a day YYYY-MM-DD, a month YYYY-MM or a year YYYY. See: Date Formatting and Date Math :: Apache Solr Reference Guide

This matches the allowed formats in a date field as defined by Dataverse.

This means that range queries are now possible on numerical and date fields, e.g. exampleIntegerField:[25 TO 50] or exampleDateField:[2000-11-01 TO 2014-12-01].

Which issue(s) this PR closes:

This PR implements ranged queries as discussed in #370 (issue was already closed)

This issue is related to #8813 and IQSS/dataverse-frontend#278 (the range queries that are now possible lay the groundwork for a nicer search facet UI)

Special notes for your reviewer:

For testing, I've created a sample TSV containing all relevant fields here.

Suggestions on how to test this:

  1. Load sample TSV and update + reload Solr schema as described in docs
  2. In the UI:
    1. Activate metadata block
    2. Activate facets for all three fields
    3. Create dataset with values in all three fields
  3. Run test range queries via the search bar, e.g. exampleIntegerField:[25 TO 50] or exampleDateField:[2000-11-01 TO 2014-12-01]
  4. Check that facets are working correctly

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Facets still look the same as before. There is only a small change in the highlighting of search results, see my comment below

Is there a release notes update needed for this change?:

Yes, there should be an info text describing the new feature + instructions for how to activate the feature:

  • the Solr schema.xml needs to be updated
  • all datasets need to be reindexed

Additional documentation:

/

@coveralls
Copy link

coveralls commented Sep 27, 2024

Coverage Status

coverage: 20.873% (+0.001%) from 20.872%
when pulling 3f5919b on vera:feat/solr-field-types
into 050064e on IQSS:develop.

@vera
Copy link
Contributor Author

vera commented Sep 27, 2024

Additionally, I've set hl.requireFieldMatch to true:

If false, all query terms will be highlighted for each field to be highlighted (hl.fl) no matter what fields the parsed query refer to. If set to true, only query terms aligning with the field being highlighted will in turn be highlighted.

https://solr.apache.org/guide/solr/latest/query-guide/highlighting.html

Two reasons:

  1. Querying solr with a date range query with activated highlighting using the default (unified) highlighter without requireFieldMatch triggers a 500 error in Solr (see my post on the Solr mailing list. My guess is that Solr is attempting to highlight the matched date range within fields in a nonsensical way which triggers the error)
  2. I think this improves the highlighting of search results, previously a match of my search term is highlighted anywhere even if I limited my query to a specific field, e.g. here "replication" is also highlighted in the title even though I limited my search specifically to the description:

image

With this change, the highlighting is limited to specific fields if the query is:

image
image
image

@pdurbin pdurbin added the Size: 3 A percentage of a sprint. 2.1 hours. label Sep 27, 2024
@vera vera changed the title feat: index numerical and date fields in Solr with appropriate types feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants